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Abstract:  Communication  requirements  of  Cholesky  factorization  of  dense  and  sparse 
symmetric,  positive  definite  matrices  ate  analyzed.  The  communication  requirement  is 
characterized  by  the  data  traffic  generated  on  multiprocessor  systems  with  local  and  shared 
memory.  Lower  bound  proofs  arc  given  to  show  that  when  the  load  is  uniformly  distributed 
the  data  traffic  associated  with  factoring  an  n  x  n  dense  matrix  using  n“,  a  <  2,  processors 
is  For  an  n  x  n  sparse  matrices  representing  a  s/n  x  i/n  regular  grid  graph 

the  data  traffic  is  shown  to  be  a  <  1. 

Partitioning  schemes  that  are  variations  of  block  assignment  scheme  are  described  and 
it  is  shown  that  the  data  traffic  generated  by  the.se  schemes  are  asymptotically  optimal. 
The  schemes  allow  efficient  use  of  up  to  0{n^)  processors  in  the  dense  case  and  up  to  0{n) 
processors  in  the  sparse  case  before  the  total  data  traffic  reaches  the  maximum  value  of 
O(n^)  and  respectively.  It  is  show'n  that  the  block  based  partitioning  schemes 

allow  a  better  utilization  of  the  data  accessed  from  shared  memory  and  thus  reduce  the 
data  traffic  than  those  based  on  column-wise  wrap  around  assignment  schemes.  -  .  , 
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1.  Introduction 


Consider  the  problem  of  stdving  a  system  of  linear  equations 

Ax  =  b 

where  A  is  an  n  x  n  symmetrie,  positive  definite  coefficient  matrix,  x  is  an  n  x  1  vector  of 
variables,  and  b  is  an  n  x  1  vector  of  constants.  Applying  the  Cholcsky  decomposition  to 
A  yields 

A  =  LL^ 

where  L  is  lower  triangular  with  positive  diagonal  elements  [lO].  From  this  factorization, 
the  solution  to  the  system  of  equations  is  obtained  by  .solving  the  triangular  systems 

Ly  =  h 

and 

i’x  =  y. 


Recently,  efforts  have  been  reported  for  efficiently  parallelizing  the  various  steps  in  comput¬ 
ing  the  solution  of  the  dense  and  sparse  systems.  Most  of  this  work  has  concentrated  on 
developing  algorithms  that  extract  as  much  parallelism  as  possible  on  specific  architectures 
[14,18,19,1,5,3,9].  The  main  emphasis  there  is  on  distributing  the  computational  load  as 
evenly  among  the  processors  as  possible  and  little  attention  is  paid  towards  the  data  traffic 
complexity. 


In  this  paper  we  are  interested  in  the  parallel  Cholesky  decomposition  schemes  with  mini¬ 
mum  data  traffic  for  factoring  dense  and  sparse  symmetric,  positive  definite  matrices.  The 
model  of  computation  assumed  for  this  ./urnose  is  that  of  a  multiprocessor  system  with  two 
level  memory  hierarchy  such  that  each  p  -“ssor  has  local  memory  and  all  processors  have 
access  to  a  common  shared  memory.  Accessing  any  nonzero  element  in  the  shared  memory 
is  assumed  to  generate  a  unit  data  traffic.  No  data  traffic  is  generated  in  accessing  the 
local  memory.  The  total  number  of  shared  memory  accesses  from  the  beginning  to  the  end 
of  the  algorithm  is  defined  as  the  communication  requirement  or  total  data  traffic  of  that 
algorithm  implemented  on  the  multiprocessor  system. 

In  [17]  the  communication  requirement  of  the  Gaussian  elimination  algorithm  implemented 
on  three  different  architectures  is  analyzed.  For  a  bus  architecture  where  a  data  element 
may  be  broadcast  to  all  the  processors  in  one  step  and  counts  as  one  unit  data  traffic 
independent  of  the  number  of  processors  receiving  the  data,  the  data  traffic  complexity  is 
shown  to  be  Q{n?).  For  a  nearest  neighbor  ting  network,  where  each  transmission  of  a  data 
element  across  a  link  of  the  ring  counts  as  one  unit  data  traffic,  the  data  traffic  complexity 
is  shown  to  be  -p),  where  p  is  the  number  of  processors  on  the  ring.  The  data  traffic 
complexity  for  a  nearest  neighbor  grid  network  is  shown  to  be  il{n^y/p).  In  all  the  cases 
it  is  assumed  that  no  element  is  computed  in  more  than  one  processor;  i.e.,  recomputation 
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is  nut  pormiftod.  By  using  a  difforont  proof  tochniquc  than  that  given  in  [17],  it  is  shown 
here  for  (ho  assumed  model  of  computation  that  the  data  traffic  complexity  of  the  Cholesky 
factoriration  scheme  is  proof  for  the  lower  bound  holds  oven  if  recomputation 

is  allowed  provided  each  processor  is  assigned  at  least  n^fCp  amount  of  work.  Although  we 
do  not  prove  it,  our  result  holds  for  other  variations  of  Gaussian  elimination  as  well. 

In  [4],  a  parallel  sparse  factorization  scheme  is  given  for  local  memory  multiprocessor  sys¬ 
tems.  This  scheme  has  a  total  data  traffic  of  0(71''*^'’ log2  n)  using  n"  processors.  This  result 
is  improved  to  in  [7].  In  this  paper  we  present  a  factorization  scheme  that  has  a 

total  data  traffic  of  0(7?' Our  main  results  and  the  organization  of  the  paper  is  as 
follows. 

In  the  following  section,  the  data  dependencies  involved  in  the  Cholesky  factorization  of  a 
dense  matrix  are  discussed  and  a  parallel  assignment  scheme  is  presented.  It  is  shown  that 
the  data  traffic  associated  with  that  scheme  is  U{n^  ■  yjj>)  when  an  77  x  77  dense  matrix  is 
factored  using  p  processors.  By  giving  a  proof  on  the  lower  bound  for  the  data  traffic,  in 
Section  2.4  it  is  shown  that  under  the  condition  of  uniform  load  distribution  the  computation 
time  and  data  traffic  complexities  of  the  assignment  scheme  are  asymptotically  optimal.  In 
Section  3.,  the  case  of  factoring  sparse,  symmetric,  positive  definite  matrices  is  considered. 
The  sparse  matrices  considered  here  arc  restricted  to  only  those  matrices  that  represent  the 
graphs  arising  in  finite  difference  and  finite  clement  applications.  In  Section  3.5,  a  block 
based  parallel  factoring  scheme  for  sparse  matrices  is  pre.sented.  The  data  traffic  in  factoring 
an  77  X  77  sparse  matrix  corresponding  to  a  2-dimcnsionaI  regular  grid  graph  is  shown  to  be 
■  \/v)-  In  Section  3.6  a  lower  bound  on  the  data  traffic  in  factoring  the  sparse  matrix 
is  shown  to  be  {l{n  ■  ^p).  These  results  can  be  extended  to  other  graphs  that  satisfy  an 
/(77)-scparator  theorem  [13].  Preliminary  versions  of  the  results  given  here  appear  in  [15] 
and  [16]. 

For  the  sake  of  clarity,  in  the  following  discussion  the  dense  matrix  is  assumed  to  be  of  size 
777  X  777  and  the  sparse  matrix  of  size  n  x  n. 


2.  Parallel  factoring  of  dense  symmetric,  positive  definite 
matrices 


The  ba.sic  algebraic  scheme  considered  here  for  factoring  an  m  x  777  symmetric,  positive 
definite  matrix  A  is  the  column  version  of  the  Cholesky  decomposition  method  [10].  An 
outline  of  this  algorithm  is  given  next  and  the  data  dependencies  arc  discussed.  Following 
that  a  partitioning  scheme  with  optimal  data  traffic  is  presented. 

In  the  following  discussion,  values  in  row  i  refer  to  the  values  of  the  elements  on  and  to  the 
left  of  the  diagonal.  Similarly,  a,,,  (a.j)  reptcsenis  all  the  elements  in  row  f  (in  column  j) 
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of  the  lower  triangular  part  of  the  matrix  under  consideration. 


2.1  The  Cholesky  factorization 

Let  .4  =  LL^  \  a,j  €  -4  and  6  L. 

for  j  =  1  until  m  do 
begin 

Initialize  /,  j  =  (i,j,  i  —  j,  -  ■  ■  ,m 

for  k  =  I  until  7  -  1  do 
for  i  =  j  until  m  do 

j  —  ^itj  ^i,tc  *  1 

~  \/^  i 

for  jfc  =  j  +  1  until  m  do 
end 


In  the  above  algorithm  for  clarity,  the  values  of  Uj  are  shown  separately  from  those  of  aij. 
In  practice  li  j  may  overwrite  a,,j. 

Clearly,  in  the  Cholesky  factorization  scheme  outlined  above,  computing  the  elements  in 
a  column  j  oi  L  requires  values  of  the  elements  in  columns  1  through  j  —  I  oi  L  and  the 
values  of  the  off-diagonal  elements  in  j  are  used  for  computations  of  elements  in  columns 
7  +  1  through  m  ol  L.  Specifically,  computing  an  element  Uj  in  L  requires  all  the  values 
from  the  set, 

=  {h,i'  1 1  <  /'  <  » {Ay  1 1  <  ;■'  <  ;■}  u  {a,-,;}. 

Moreover,  the  steps  of  the  innermost  loop,  where  a  product  of  two  elements  of  L  is  sub¬ 
tracted  from  a, ,7,  may  be  performed  in  any  order.  Once  is  computed,  it  is  used  in  the 
computation  of  every  element  in  the  set, 

=  {A,j'  I  i  < }'  <  *}  u  Vj'.i  !»<;'<”*}■ 

Again  the  element  /,j  may  be  used  in  any  order  in  the  computations  of  the  elements  in  the 
set  Aij. 
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2.2  A  partitioning  scheme  for  Cholesky  factorization 


\^’ifhout  loss  of  generality,  assume  that  p  —  (r^  +  r)/2  where  r  is  an  integer.  The  lower 
triangular  part  of  the  matrix  A  is  divided  into  p  partitions  by  taking  r  vertical  and  r 
horizontal  sections  each  of  size  s,  where  s  =  mfr.  All  except  r  of  the  resulting  p  partitions 
are  square  blocks  of  size  a  x  s.  The  remaining  r  partitions  which  lie  on  the  diagonal  of  the 
matrix  are  a  x  ^  triangular  blocks.  Each  of  the  partitions  is  assigned  to  a  single  processor. 
Initially,  each  processor  reads  the  data  for  its  partition  from  shared  memory  into  its  local 
memory.  The  cc>mputations  proceed  in  parallel  according  to  the  column  version  Cholesky 
algorithm  as  follows.  The  r  processors  in  charge  of  the  partitions  containing  the  left  most 
a  X  a  blocks  of  the  matrix  commence  the  computations  of  their  part  of  the  factorization. 
As  soon  as  an  elem.ent  of  the  factor  is  computed,  it  is  written  into  shared  memory  for  access 
by  other  processors.  As  the  necessary  data  becomes  available,  the  remaining  processors 
initiate  computations  on  the  blocks  assigned  to  them.  This  is  continued  until  the  entire 
factor  is  computed  and  written  into  the  shared  memory.  This  partitioning  and  factorization 
scheme  for  dcn.se  symmetric,  positive  definite  matrices  is  referred  to  as  the  block  oriented 
column  C/io/ejA:y  factorization  scheme  or  simply  as  the  BLOCC  scheme. 


Figure  1:  The  data  traffic  associated  with  block  I 


2.3  Data  traffic  complexity  of  the  BLOCC  scheme 


First  consider  the  data  traffic  associated  with  computing  the  elements  in  a  generic  square 
block  1  in  the  factor  L  shown  in  Figure  1.  In  that  figure,  the  darkened  area  represents  the 
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data  elements  that  are  required  for  the  computations  in  block  7.  Let  block  I  be  bounded  by 
the  columns  (?  -  I).'*  +  1  and  i  ■  .s,  and  In’  the  rows  {j  —  l).s  +  1  and  j  ■  f<  where  1  <  f  <  r  and 
1  <  ;  <  r.  The  following  lemma  provides  bounds  for  the  communication  cost  of  block  7. 


Lemma  1  A  total  data  traffic  of  (2i —  +  s/2  is  necessary  and  sufficient  for  computing 

the  elements  in  the  square  block  I;  it  is  the  same  for  all  square  blocks  bounded  by  the  columns 
(?  —  1).“!  +  1  and  i  ■  s.  The  data  traffic  associated  with  as  x  s  triangular  diagonal  block 
hounded  by  the  columns  (i  -  1).<?  +  1  and  i  ■  s,  is  {i  -  1/2)  •  +  s/2. 


Proof:  See  [15]. 
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Using  these  results,  a  bound  on  the  total  data  traffic  is  obtained,  as  shown  next. 


Theorem  1  The  total  data  traffic  associated  with  the  BLOCC  scheme  for  factoring  an 
m  X  m  dense  symmetric,  positive  definite  matrix  using  p  processors  ia  0{m.^ y/p). 


Proof :  The  total  data  traffic  associated  with  all  the  blocks  bounded  by  columns  (i  -  l)a  +  1 
and  t  •  1  <  i  <  r,  is  given  by, 

(r  —  i)  X  (data  traffic  associated  with  a  square  block) 

+  (data  traffic  associated  with  a  triangular  block  bounded  by  the  given  columns). 

From  Lemma  1  and  the  fact  that  there  are  r  such  column  partitions,  we  get  the  total  data 
traffic  involved  in  factoring  the  m  x  m  dense  matrix  using  the  BLOCC  scheme  as: 

^  ((r  -  i)((2i  -  1/2)  -  a"  +  a/2)  +  ((i  -  1/2)  •  a^  +  s/2)). 
i=l 

Ignoring  the  lower  order  terms,  the  total  data  traffic  is 

=  ^  (2r  .  i  -  2«2)  •  a"' 

i=l 

Since  p  =  (r^  +  r)/2  and  a  =  m/r,  the  total  data  traffic  is  0{m^y/p).  I 


2.4  A  lower  bound  on  the  data  traffic  complexity 


Theorem  i  gi'.  v;s  an  upper  bound  on  the  data  traffic  associated  with  factoring  the  m  X  m 
matrix  using  the  BLOCC  assignment  scheme.  In  the  next  theorem  a  lower  bound  on  the 
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data  traffic  in  factoring  the  dense  symmetric,  positive  definite  matrix  is  established.  Before 
giving  the  proof,  we  again  consider  the  data  dependencies  involved  in  computing  an  element 
of  the  factor.  Consider  the  compulations  at  any  element  aij  as  shown  in  Figure  2.  To 
compute  the  corresponding  clement  /.j,  values  at  all  the  elements  in  row  j  and  the  values 
of  elements  in  column  1  through  j  of  row  i  are  needed.  Thus,  if  o,j  is  an  off  diagonal 
element  then  2j  values  are  needed  for  the  computations  and  if  it  is  a  diagonal  element  (i.e., 
i  =  j)  then  j  values  are  required.  There  are  three  observations  to  make  regarding  these 
computati(.ins  as  follows: 


i.  The  values  at  all  the  elements  in  any  row  t  are  needed  to  complete  the  computations 
corresponding  to  the  diagonal  element  a,,,;  no  other  values  are  needed  for  computing 
Moreover,  no  other  element  in  the  factor  can  be  computed  by  knowing  only  the  values  in 
row  i 

ii.  If  i  and  j  are  any  two  rows  such  that  i  >  j,  then  the  values  of  the  elements  in  these  two 
rows  are  used  to  complete  computations  at  exactly  one  off-diagonal  element  a,j.  Values 
from  no  other  row  are  needed  to  complete  the  computations  at  that  element. 

iii.  For  the  computations  at  a  subset  of  elements  spread  over  kr  rows  and  kc  columns,  values 
from  at  least  max(it,,  itc)  rows  are  needed. 


Figure  2:  Data  dependencies  for  the  computations  at  o,,j  and  at  ele¬ 
ments  in  subset  5 

The  last  observation  follows  from  the  fact  that  the  computations  require  v»’n<»s  fcom  kr 
rows  as  well  as  from  the  kc  rows  that  correspond  to  the  kc  columns.  However,  some  of  the 
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nnvs  and  the  k,-  rows  may  overlap.  In  addition,  note  that  if  j'  is  the  leftmost  of  the  k^ 
(•ohiinns  then  at  least  all  the  values  in  rolnmns  1  through  j'  on  max(^r,  k^)  rows  are  retjuired 
in  the  computation  of  the  elements  in  the  subset  under  consideration  (see  Figure  2). 

1 

'  The  above  observations  and  the  result  established  in  the  following  lemma  are  used  to  get  a 

lower  bound  on  the  data  trafPlc  in  computing  the  Cholesky  factor. 


LeiTuna  2  Let  lb  he  the  aiiiount  of  computational  work  which  t.?  to  be  di.ftrthutrd  uniformly 
amovy  ]>  prorcssors  and  let  a  hr.  any  constant  Icxf  than  one.  For  any  unhsct  of  thin  compu¬ 
tation  connintinq  (if  11/2  amount  of  work,  there  are  at  leant  (1  —  o)  •  p/(2  -  o)  proemnorn 
each  anntyned  a  ■  W  I2]i  or  more  work  from  that  nuhnet. 


Proof:  The  work  is  uniformly  distributed  among  p  processors  and  hence  each  processor 
is  assigned  IT/p  amount  of  work.  Now  let  5  be  a  subset  consisting  of  n’/2  amount  of 
comj)uta tional  work.  All  p  processors  may  l)e  assigned  .some  portion  of  work  from  S.  Let 
?/',  he  the  eomjmta tional  work  from  S  a.ssignod  to  proressor  p,,  where  0  <  t/',  <  IT/p. 
Therefore, 


and  iy/2p  is  the  average  work  from  S  performed  by  each  processor.  Thus,  there  is  at  least 
one  processfir  that  i.s  assigned  iy/2p  or  more  work  from  5.  Let  q  be  a  constant  less  than 
one  and  suppose  that  r  processors  arc  assigned  at  least  a  ■  lT/2p  amount  of  work  from  S. 
Each  of  the  t  processors  may  be  assigned  at  most  IV/p  work  from  S.  Now  there  are  p  -  x 
i^rocessors  that  compute  less  than  a  •  ir/2p  amount  of  work  from  5.  Therefore, 


II 

P 


a  ■  IT’  , 

X  +  — 7 — -  •  (p  —  x)  > 


2p 


II 

2  ■ 


Sv'dving  the  inequality  for  x  we  get, 


1  -  a 
2^ 


■P- 


Thus,  there  are  at  least  (1  -  a)  •  p/(2  —  a)  processors  each  computing  a  •  n’/2p  or  more 
amount  of  work  from  5.  I 


In  the  following  theorem  a  bound  on  the  data  traffic  associated  with  computing  the  Cholesky 
factor  of  an  m  xm  matrix  is  established.  Note  that  the  result  holds  under  stronger  conditions 
than  required  by  the  model  of  computation  assumed  here. 


Theorem  2  Let  A  he  a  dense,  mx  m  .symmetric  positive  definite  matrix  that  resides  in  the 
eommon  memory.  If  the  computational  work  is  uniformly  distributed  among  p  processors, 
then  the  data  traffic  involved  in  computing  the  Cholesky  factor  of  A  is  Forp  >  6, 

the  data  traffic  is  Q{m^  ■  y/p)  even  if  the  initial  values  of  matrix  A  are  in  the  processor  local 
memory  before  the  computation  begins. 
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Figure  3:  Data  traffic  associated  with  factoring  of  elements  in  region 

FGH 

Proof:  Suppose  that  matrix  A  is  initially  stored  in  the  common  memory.  The  initial  value 
of  each  element  is  fetched  at  least  once  by  the  processors  for  computing  the  factor.  Thus, 
at  least  m^/2  amount  of  data  traffic  is  associated  with  the  Cholesky  factorization.  Hence 
when  the  number  of  processors  is  a  constant,  there  is  nothing  to  prove. 

In  the  following  it  is  proved  that  even  if  the  matrix  A  is  distributed  initially  among  the 
processors  according  to  their  work  assignment,  the  data  traffic  is  n(m*  •  when  the 
number  of  processors  is  greater  than  16(a  +  4/a  —  4)/3  for  any  constant  a  less  than  one. 
Since  there  exists  an  a  less  than  one  such  that  16(a +  4/a  —  4)/3  is  less  than  six,  the  proven 
result  holds  for  all  p  >  6.  Consider  the  computations  corresponding  to  the  elements  in  the 
set  5  =  {a,  j  I  i,j  >  mfl}.  In  Figure  3  the  region  FGH  denotes  this  set  of  elements.  The 
total  computational  work  in  factoring  the  m  x  m  matrix  is  m’/6  -f  m*/2  +  m/3  and  that 
corresponding  to  the  elements  in  set  5  is  m*/12  +  m*/4  +  m/6.  Thus,  the  amount  of  work 
associated  with  5  is  exactly  half  of  the  total  work.  If  a  is  a  constant  less  than  one,  then 
from  Lemma  2,  there  are  at  least  (1  -  a)  •  p/{2  -  a)  processors  each  computing  at  least 
o  •  m*/12p  amount  of  work  from  the  region  FGH. 

Let  n  be  the  set  of  processors  each  with  at  least  o-m®/12p  amount  of  work  from  the 
region  FGH  and  let  p,  €  H.  Now  the  computational  work  associated  with  any  element  in 
FGH  is  at  most  m  (work  corresponding  to  element  a„  ,n  )•  Therefore  at  least  a  •m^/12p 
elements  in  the  region  FGH  are  assigned  to  processor  p,.  Let  z  be  the  number  of  rows 
on  which  the  elements  assigned  to  processor  p,  lie.  This  implies  that  there  are  at  least 
f(a  •  m})l(\2  p  •  z)]  columns  on  which  the  elements  assigned  to  p,  lie.  Therefore,  &om  the 


()hsrr\  a t ii in  (iii)  atxivo,  data  frnin  at  loast  inax(r,  [(o  ■  ?w^)/(12  ■  jt  ■  7)])  nunibrr  i>f  rows  in 
th('  region  DFGE  are  reejnirod  for  cijniidolinR  tho  coinputalions  from  region  FGH  assigned 
to  processor  /),.  ^^’ithont  loss  of  generality,  assume  that  the  quantity  12  ■  p  ■  t  divides  a  •  rti^ 
ev(Mily.  .\o\v,  the  ejuantity  iiiax(3'.  (f>  •  ?t)^)/(12  ■  p  ■  r))  is  minimum  when  t  =  i/tl  •  ml^/ifp. 
Thus,  the  coinputations  in  processor  p,  reriuire  at  least  all  the  values  of  the  elements  on 
the  ydr  A7)/yT2p  rows  in  region  DFGE.  In  region  DFGE  each  row  has  m/2  elements  and 
thus  the  values  of  at  least  s/n  ■  ni^/{-i  ■  i/Sp)  elements  from  the  region  DFGE  are  needed 
in  I'rocessor  p,  for  performing  eomptit alions  in  the  region  FGH. 


Now  jiroressor  p,  may  also  he  assigned  some  work  from  the  region  DFGE  in  addition  to  that 
in  FGH .  Hence  to  complete  the  proof,  it  is  necessary  to  show  that  of  the  y/iu  •  m^/(4  ■ 
elements  needed  by  processor  p,  at  least  c  •  elements  are  not  available  locally,  where 

r  is  a  constant  less  than  one.  In  that  case  the  data  traflRc  associated  with  processor  p,  is 
at  least  c  •  m^/^p  and  since  there  are  at  least  (l-n:)-p/(2-Q)  such  processors,  the  total 
data  traffic  in  computing  the  Cholesky  factor  of  an  m  x  rn  dense  matrix  is  fl(m^  •  ^/p).  ^^’e 
complete  the  proof  by  showing  in  the  following  that  p,  accesses  at  least  c  ■  m"^ / yjp  non-local 
elements  from  region  DFGE  for  completing  the  computations  in  the  region  FGH . 


Processor  p,  is  assigned  at  least  n  m^/l2p  amount  t)f  work  from  the  region  FGH.  Since 
each  processor  is  assigned  m^/Gp  amount  of  work  (the  uniform  load  distribution  condition), 
p,  performs  at  most  (2  -  o)  •  /I2p  amount  of  work  in  the  region  DFGE.  The  data  trafRc 
associated  with  processor  p,  in  completing  tho  work  in  the  region  FGH  is  a  minimum 
when  all  the  elements  from  region  DFGE  assigned  to  p,  lie  on  the  \/q  ■  mj \J\2p  rows. 
Furthermore,  to  reduce  the  data  traffic,  as  many  elements  on  these  rows  as  possible  should 
be  assigned  to  processor  p,.  Now  the  computational  work  corresponding  to  any  clement 
is  that  is  tho  work  associated  with  an  clement  on  the  leftmost  column  of  the  matrix 
is  the  smallest  and  it  incxcascs  for  elements  on  any  row  from  left  to  right.  Therefore  tho 
data  traffic  associated  with  p,  is  a  minimum  when  it  is  also  assigned  the  computational 
work  corresponding  to  the  elements  in  the  leftmost  columns  on  the  chosen  rows  of  region 
DFGE.  Let  k  be  the  number  of  the  leftmost  columns  on  which  the  the  elements  from  region 
DFGE  that  are  assigned  to  p,  lie.  The  shaded  region  shown  in  Figure  3  corresponds  to  the 
elements  which  minimire  the  data  traffic  for  processor  p,.  Since  processor  p,  performs  at 
most  (2  -  o)  •  m^/\2p  amount  of  work  in  DFGE,  the  condition  on  k  is  given  by. 


y/a  ■  VI 


Li  < 


(2  -  q)  • 


i.e.,  _ 

,  1  I  4(2  -  a)  1 

k  <  -  Jl+  ^  ^  --  •  -  -. 

-  2y  y/a  v/3p  2 

It  can  be  verified  that  there  is  a  constant  fl  greater  than  one,  such  that  if  p  is  greater 
than  lG(o  -  4  +  4/o')/3,  then  the  right  hand  side  of  the  above  inequality  is  at  most 
777/2/9  f(7r  all  values  of  m.  This  gives  a  bound  on  k.  Therefore  work  corresponding  to 
at  most  {y/n  ■  7n^)/(4/9  •  y/^)  elements  in  the  region  DFGE  may  bo  assigned  to  processor 
p,  which  will  minimize  its  data  traffic  for  the  computation  in  the  region  FGH .  Hence  of 
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thr  (\M  •  n)^)/(2  •  \/l2p)  cloiiionts  needed  by  processor  p,  for  eompleting  the  computation 
in  the  region  FGH,  at  least  (1  -  1//?)  ■  \/n  ■  elements  are  not  available  locally. 

Thus,  if  the  number  of  proce.ssc>rs,  p.  is  greater  than  lG(rt  -  4  +  4/a)/2  for  any  q  less  than 
one.  then  the  data  trafUc  associated  with  processor  p,  is  at  h'ast  c  ■  I ^  for  some  con¬ 
stant  c  less  than  one.  Since  there  are  at  bast  (1  —  n)  pl[2  —  o)  such  processors,  the  result 
follows.  I 


2.5  Remarks  on  the  BLOCC  Scheme 


Assuming  that  each  step  of  the  innermost  loop  in  the  Cholesky  decomposition  costs  one 
computational  time  unit  and  ignoring  the  costs  associated  with  other  steps,  the  sequential 
computation  time  for  factoring  the  m  x  m  matrix  A  is  jG  A  0{rv^).  The  BLOCC  scheme 
described  above  has  a  computation  time  of  m'’/2p  +  0{m'^ jp),  where  p  is  the  number  of 
processors  used.  As  shown  in  Theorem  1  the  associated  data  traffic  is  less  than 
Thus,  the  time  and  the  data  traffic  complexities  of  the  BLOCC  scheme  are  optimum  in  an 
order  of  magnitude  sense.  However,  the  computational  load  in  the  BLOCC  assignment 
scheme  is  not  perfectly  balanced.  The  processors  that  compute  elements  in  the  partitions 
that  are  towards  the  left  side  of  the  matrix  L  finish  computation  earlier  than  those  that  are 
on  the  right.  This  balance  may  be  improved  in  several  different  ways,  but  at  the  cost  of 
increasing  the  data  traffic.  In  one  such  scheme  the  columns  of  the  matrix  are  assigned  to 
each  processor  in  a  wrap  around  fashion;  that  is,  columns  j,p  +  t, . . . ,  m  -  p  + »  are  assigned 
to  process(jr  i.  All  the  elements  on  any  column  of  L  are  computed  by  a  single  processor. 
Let  this  as.signment  scheme  be  referred  to  as  the  wrap  around  assignment  scheme.  In  this 
scheme  the  computation  is  distributed  more  evenly  among  the  processors  than  that  in  the 
BLOCC  scheme.  The  computation  time  is  reduced  to  m^/Gp  -f  0{m^),  provided  m  is  at 
least  p(p  +  3)/2.  However  the  data  traffic  associated  with  this  scheme  is  p/Z.  which 
is  suboptimal.  In  [ll]  and  [2]  the  wrap  around  assignment  scheme  is  recommended  as 
a  preferred  method  for  computing  the  factor  on  a  multiprocessor  system  because  of  its 
good  load  balancing  properties.  Their  analysis  does  not  take  into  account  the  cost  of  the 
associated  data  traffic,  which  must  be  taken  into  account  for  reducing  the  overall  execution 
time. 


Parallel  factoring  of  sparse,  symmetric  positive  definite 
matrices 


In  this  section  a  partitioning  and  assignment  scheme  is  presented  that  computes  the  factors 
of  an  n  x  n  matrix,  associated  with  a  2-d  regular  grid  graph,  using  n®,  o  <  1,  processors 
with  a  total  data  traffic  of  0(n' Then  it  is  shown  that  the  data  traffic  in  factoring  the 
matrix  is  when  the  load  is  distributed  uniformly  among  n”  processors,  a  <  1. 
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It  is  also  shown  that  in  any  scheme  that  requires  n"  processors,  «  >  1,  the  data  traffic  is 

f)(nV2). 


For  factoring  a  sparse  matrix  efficiently  proper  ordering  of  the  matrix  is  essential.  Ordering 
of  the  matrix  to  be  factored  also  determines  the  data  dependencies  and  hence  the  data  traffic 
associated  with  any  partitioning  and  assignment  scheme.  For  matrices  associated  with 
regular  grid  graphs,  nested  dissection  is  a  well  known  ordering  scheme  [6].  In  the  following 
a  few  basics  of  this  ordering  scheme  are  briefly  described  and  some  notation  is  introduced 
that  is  necessary  for  the  analysis  presented  later.  In  the  following  it  ’s  assumed  that  the 
reader  is  familiar  with  the  elementary  concepts  underlying  the  nested  dissection  algorithms, 
and  the  terms  such  as  elimination  order  and  the  fill  associated  with  the  elimination  process. 
It  is  also  assumed  that  the  reader  is  familiar  with  the  basic  graph  theory  concepts  related  to 
matrix  representations  of  systems  of  equations,  in  particular,  the  notion  of  vertices,  edges, 
separators,  subgraphs  of  a  graph,  and  the  correspondence  between  the  vertices  and  the 
rows  and  columns  of  the  matrix,  between  the  edges  and  and  the  non  zero  elements,  and  the 
added  edges  and  the  fill-in  during  the  factorization  of  the  matrix.  For  details  see  [12]  and 
[6]  and  the  references  therein. 


3.1  Nested  dissection  method  as  applied  to  2-d  grid  graphs 

A  nested  dissection  method  may  be  viewed  as  a  divide- and-conquei  algorithm  on  an  undi¬ 
rected  graph.  It  relics  on  finding  a  small  set  of  vertices,  called  the  separator  set,  in  the  graph 
such  that  the  removal  of  these  vertices  divides  the  graph  approximately  in  half.  Informally, 
the  nested  dissection  method  orders  the  vertices  of  the  graphs  as  follows.  The  vertices  in 
the  separator  set  are  ordered  last.  Then  the  vertices  in  the  subgraphs  obtained  horn  the 
original  graph  by  removing  the  separator  are  ordered  recursively.  In  [12]  a  nested  dissection 
algorithm  is  given  for  ordering  the  vertices  of  any  graph  G  such  that  G  and  all  subgraphs 
of  G  satisfy  a  ^/n-scparatoI  theorem.  The  ordering  produced  by  this  algorithm  guarantees 
a  C)(n  log  n)  fill  and  sequential  operation  count  for  a  system  corresponding  to  an 

n-vertex  graph  G.  In  [8]  a  nested  dissection  algorithm  is  given  for  ordering  the  vertices  of 
a  graph  G  that  has  a  y/n-separator  decomposition.*  For  a  detailed  treatment  of  the  nested 
dissection  methods  and  for  the  relevant  practical  applications  see  [6]. 

For  the  sake  of  simplicity  and  clarity,  here  only  the  systems  corresponding  to  \/n  x  ^/n 
regular  grid  graphs  are  analyzed.  However,  the  techniques  developed  for  analyzing  data 
traffic  complexities  are  applicable  to  other  systems  where  the  nested  dissection  method  can 
be  used  to  give  a  “good”  ordering.  In  the  following,  the  nested  dissection  method  used 
for  ordering  the  vortices  in  a  y/n  x  y/n  regular  grid  graph  is  briefly  described.  In  the 
discussion,  the  grid  graph  is  sometimes  simply  referred  to  as  the  grid  and  a  subgraph  of  the 
grid  graphs  is  referred  to  as  a  subgrid.  For  the  rest  of  the  discussion,  it  is  also  assumed  that 

"A  graph  (7  is  said  (n  have  a  y/n-sepnrator  decomposition  for  constants  a  <  I  and  0  >  0  if  G  has  a 
v/n  separator  C  and  every  connected  component  of  G'—  C  has  a  y/ii-separator  decomposition. 
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Figure  4:  A  7  x  7  grid  with  nested  dissection  ordering 

the  vertices  of  the  grid  are  connected  according  to  a  9-point  stencil,  unless  otherwise  stated. 
Let  V  be  the  set  of  vertices  of  a  v/w  x  y/n  regular  grid  graph.  Without  loss  of  generality, 
assume  that  y/n  =  2*  -  1  for  some  integer  1.  Let  So  be  the  set  o^  2*  —  1  vertices  on  a  vertical 
mesh  line,  the  removal  of  which  partitions  V  into  two  subgrids,  V\  and  such  that  the 
vertices  of  both  the  subgrids  are  arranged  in  a  (2*  —  1)  x  (2*“^  —  1)  mesh.  The  vertices  of 
So  are  numbered  from  n  -  y/n  -I-  1  to  n  in  any  order.  Suppose  that  V’l  is  the  left  subgrid 
and  V2  is  the  right  subgrid.  Let  Si  be  the  set  of  vertices  on  a  horizontal  mesh  line  that 
divides  Li  into  two  equal  parts  each  containing  (2*~*  —  1)^  vertices  that  are  arranged  along 
a  (2*"*  -  1)  X  (2*”'  -  1)  square  mesh.  SimUarly,  let  S2  be  the  set  of  vertices  &om  V2  which, 
when  removed,  produce  two  equal  halves  from  V2.  Both  Si  and  S2  contain  2*“*  —  1  vertices. 
Let  the  vertices  in  Si  be  numbered  from  n  —  2y/n  -f  2  to  n  —  Zy/nfl  -t- 1/2  and  those  in  S2 
be  numbered  from  n  —  3y^/2  +  3/2  to  n  —  y/n.  Thus,  the  removal  &om  V  of  the  vertices 
in  the  set  SqU-^i  partitions  V  into  four  {■^n  —  l)/2  x  {y/n  —  l)/2  snbgrids.  The 
separator  set  SqU-S^i  is  referred  to  as  the  -separator  ioi  the  grid  corresponding  to 
V.  The  middle  vertical  part  of  the  “-l-”-separator  is  referred  to  as  the  vertical  suh-separator 
and  each  of  the  two  horizontal  halves  of  the  “+”-separator  is  referred  to  as  the  horizontal 
sub- separator.  All  the  vertices  of  the  four  subgrids  are  numbered  by  recursively  identifying 
and  ordering  the  vertices  on  the  “-|-”-8eparators  of  the  subgrids  induced  by  the  vertices 
ordered  so  far.  The  recursion  stops  when  a  subgrid  has  only  one  vertex  on  it.  For  any 
“-(-"-separator,  there  is  a  vertical  sub-separator  and  two  horizontal  sub-separators.  With 
the  above  described  ordering  scheme,  for  any  given  “-(-"-separator,  the  vertices  on  the  two 
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hori7nntal  suh-soparators  arc  given  numbers  that  are  smaller  than  those  assigned  to  the 
vertires  on  the  corresponding  vertical  sub-separator.  Thus,  we  say  that  the  vertices  on 
a  horizontal  sub-separator  are  ordered  ahead  of  the  vertices  on  the  corresponding  vertical 
sub-separator  or  that  the  vertices  on  the  vertical  sub-separator  are  ordered  after  those  on 
the  horizontal  sub-separators.  An  example  of  ordering  the  vertices  in  a  7  x  7  grid  is  shown 
in  Figure  4.  Observe  that  the  grid  is  recursively  partitioned  into  four  subgrids  by  a  set  of 
vertices  that  form  a  ‘■4-'‘-separator. 

To  label  the  subgrids  and  the  separators  of  the  grid  graph,  we  use  the  notation  given  in 
[7].  Each  subgraph  and  the  separator  that  induces  the  subgraph  are  given  a  level  number 
depending  on  the  recursion  level  of  the  nested  di.s.section  on  which  the  subgraph  is  ordered. 
Under  this  scheme  the  original  grid  is  called  a  level-0  (sub)grid.  The  four  subgrids  of  size 
(V^-l)/2  X  (v^-l)/2  are  the  level-1  .subgrids.  The  “-t-”-separator  that  partitions  the 
level-()  grid  into  the  four  level-1  subgrids  is  called  the  level-1  “-|-”-separator  or  simply  as  the 
level-1  separator.  Thus,  if  n  is  equal  to  (2*  ~  1)^,  there  are  I  levels  of  subgrids  numbered  0 
through  I  —  1  and  /  —  1  levels  of  separators,  numbered  1  through  /  —  1. 

In  the  following  it  is  assumed  that  the  matrix  to  be  factored  is  ordered  using  the  nested 
dissection  scheme  and  that  the  symbolic  factorization  step  is  already  completed. 


3.2  Cholesky  factorization  scheme  revisited 

Consider  the  Cholesky  factorization  scheme  described  in  Section  2.1  for  factoring  a  sparse 
symmetric,  po.sitive  definite  matrix. 

Clearly,  the  main  difference  between  factoring  a  sparse  and  a  dense  matrix  using  the 
Cholesky  factorization  scheme  is  that  in  the  former  case  there  is  no  need  to  modify  col¬ 
umn  j  by  all  columns  to  the  left  of  it.  Specifically,  column  j  is  modified  only  by  columns 
k  for  which  / j  *  ^  0.  Moreover,  if  column  k  modifies  column  ;,  only  the  nonzero  elements 
of  column  k  need  to  be  fetched.  Exactly  which  elements  are  needed  is  formalized  later.  In 
Figure  5(a),  the  zero-nonzero  structure  of  L,  corresponding  to  the  vertices  of  the  separators 
on  the  first  two  levels,  is  shown  schematically.  The  shaded  areas  represent  the  nonzeros. 
The  corresponding  grid  is  shown  in  Figure  5(b).  It  is  clear  from  the  figure  that  only  certain 
values  from  certain  columns  are  needed  for  computing  an  element  of  the  factor. 

Another  important  difference  is  that,  because  of  the  ordering  applied,  several  columns  may 
be  computed  .simultaneously.  As  stated  earlier,  column  i  and  row  j  of  the  matrix  corresponds 
to  a  vertex  n,  in  the  elimination  graph  and  the  factoring  of  the  matrix  corresponds  to  the 
elimination  of  the  vertices.  Thus,  all  the  vertices  on  the  level  f—1  subgrids  may  be  eliminated 
simultaneously  followed  by  those  on  the  level  1—2  and  so  on.  This  observation  is  useful  in 
extracting  parallelism  in  the  factorization  step. 
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Figuie  5:  Structure  of  L 

3.3  The  worst  case  data  tralffic  complexity 

In  this  section  a  bound  on  the  worst  case  data  traffic  complexity  for  factoring  the  matrix  A 
is  established.  Clearly,  the  communication  requirement  is  the  worst  when  the  use  of  local 
memory  is  not  allowed.  Thus,  an  upper  bound  on  the  worst  case  data  traffic  is  obtained 
by  assuming  that  the  values  of  all  the  elements  of  the  lower  triangular  part  of  matrix 
A  and  those  of  L,  as  well  as  any  intermediate  results,  are  stored  in  the  shared  memory. 
Suppose  also  that  any  number  of  processors  ate  allowed  to  participate  in  computing  a 
nonzero  element  of  the  factor  provided  that  no  computation  is  repeated.  Consider  the 
computations  associated  with  a  nonzero  element  Uj  €  L.  Recall  that  in  computing  /,  j,  first 
a,j  - j  h,k-fj,k  is  evaluated  and  then  the  resulting  value  is  divided  by  Thus,  for  each 
multiplication,  there  is  one  subtraction  operation,  at  most  one  division  and  three  memory 
references  and  a  constant  overhead  such  as  index  computation.  Therefore,  in  the  worst  case, 
each  multiplication  operation  in  the  Cholesky  factorization  is  associated  with  a  constant 
amount  of  data  traffic.  The  following  theorem  gives  a  bound  on  the  worst  case  total  data 
traffic.  In  the  proof  of  the  theorem,  the  result  given  in  Theorem  8.1.8  of  [6]  is  assumed. 
That  theorem  states  that  the  number  of  operations  requited  to  factor  a  matrix  associated 
with  an  n-vertex  2-D  grid  ordered  by  nested  dissection  is  given  by  829n*/®/84  +  0(ii-logn). 
Although  the  following  result  is  obvious,  it  is  useful  because  it  is  independent  of  the  number 
of  processors  used  and  it  gives  the  worst  case  bound  on  the  data  traffic  even  for  the  models 
of  computation  that  arc  more  restrictive. 
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Theorem  3  The  worst  case  data  traffic  associated  with  factoring  the  matrix  A  is 


Proof:  Assiiriatod  with  each  multiplication  operation  in  the  factorization  there  are  at  most 
a  constant  number  of  memory  references.  Suppose  that  k  memory  references  are  involved 
per  multiplication.  Thus,  the  total  data  traffic  is 

<  k  ■  number  of  multiplication  operations. 

Now,  the  number  of  multiplication  operations  associated  with  factoring  matrix  A  is 

[6].  Hence,  the  worst  case  total  data  traffic  is  I 

Note  that  the  above  theorem  is  applicable  to  all  the  graphs  for  which  a  v^-sep^rator 
theorem  holds. 


3.4  Data  dependencies  for  the  sparse  Cholesky  factorization 


The  worst  case  bound  on  the  data  traffic  established  in  Theorem  3  can  be  improved  for  the 
model  of  architecture  assumed  in  the  case  of  the  dense  matrices.  In  that  model,  no  element 
is  fetched  more  than  once  from  the  shared  memory  and  hence  the  values  of  the  elements  used 
in  more  than  one  operation  ate  stored  in  the  local  memory  associated  with  the  processor. 
To  maximize  the  potential  of  such  a  model,  it  is  necessary  to  clearly  understand  the  data 
dependencies  involved.  The  vertices  of  the  grid  are  ordered  using  the  recursive  nested 
dissection  scheme.  Hence  it  is  sufficient  to  investigate  the  data  dependencies  involved  in 
computing  the  elements  of  L  in  the  columns  corresponding  to  the  vertices  in  a  generic 
“+”-separator.  This  is  accomplished  in  the  next  two  lemmas. 

Let  rf-  =  {it|A:  <  j  and  ^  0,/,,*  €  i};  i.e.,  rf-  is  the  set  of  all  columns  of  the  factor  L  to 
the  left  of  the  column  j  +  1  such  that  the  elements  in  row  i  of  these  columns  are  nonzero. 
Let  ijj vi',  i  c.,  Vi  ^  is  the  set  of  all  the  columns  to  left  of  column  ; '  +  1  such  that  on 
each  of  these  columns  there  is  a  nonzero  element  in  at  least  one  of  the  i  through  k  rows  of 
the  factor.  Let  L  represent  any  m-vertex  sub-separator.  It  is  assumed  that  all  the  vertices 
in  any  sub-separator  are  ordered  consecutively.  Let  /o7/;(r)  and  high{T)  be  the  indices  of 
the  lowest  and  the  highest  ordered  vertices,  respectively,  on  the  sub-separator  T.  Note  that 
high(T)  -  low{T)  -I-  1  =  m.  In  Figure  6,  a  sub-separator  T  is  shown.  This  sub-separator 
separates  the  vertices  in  regions  Ri  and  i?2-  The  diagonal  and  off-diagonal  non-zero  blocks 
associated  with  this  sub-separator  are  shown  in  Figure  7. 

The  following  lemma  establishes  some  basic  sub-separator  related  properties  that  are  useful 
in  analyzing  the  communication  requirements. 

Lemma  3  Lr.tT  be.  any  m-vertex  sub-se.parator.  (i)  Corresponding  to  the  vertices  ofT  there 
is  a  dense  m  x  m  triangular  diagonal  block  in  the  Cholesky  factor,  (ii)  In  the  factor  L,  the 
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Figure  6:  Sub-separator  F  with  four  surrounding  sub-separators 

columns  low{T)  through  high{T)  contain  at  most  four  off-diagonal  rectangular  blocks  with 
nonzero  elements.  Each  of  these  blocks  is  of  size  o<  most  (ci  •  m  -f  C2)  x  m  where  Ci  <  2 
and  cj  <  3  are  positive  integer  constants.  Any  nonzero  element  in  these  columns  is  either 
in  one  of  these  four  blocks  or  in  the  diagonal  triangular  block. 


Proof:  The  first  part  of  the  lemma  is  obvious.  In  Figure  6,  the  sub-separator  F  separates 
the  vertices  in  regions  Ri  and  Since  the  vertices  in  these  two  regions  are  ordered  ahead 
of  those  of  r,  the  fill  due  to  the  elimination  of  vertices  in  regions  Ri  and  J?2  ensures  a  dense 
m  X  m  triangular  diagonal  block  bounded  by  columns  low{T)  and  high{T)  as  shown  in 
Figure  7. 

To  prove  the  second  part  of  the  lemma,  again  consider  Figure  6.  In  that  figure,  the  thickness 
of  the  lines  qualitatively  indicates  the  separator  levels  in  the  nested  dissection  ordering.  Let 
Ti,  r2,  Fa,  and  F4  be  the  four  partial  sub-separators  that  surround  the  sub-separator  F. 
Because  of  the  nature  of  the  nested  dissection  ordering,  the  vertices  of  F  are  “connected”  to 
only  those  higher  ordered  vertices  that  lie  on  Fi,  F2,  Fj,  and  F4  and  to  no  other  vertices.^ 
Thus,  all  the  nonzeros  on  columns  /o«)(F)  through  high{T)  in  rows  below  high{T)  are 
confined  to  only  the  rows  corresponding  to  the  vertices  on  Fi,  F2,  Fs,  and  F4.  Furthermore, 

^Vertex  «  is  »id  to  be  “connected”  to  vertex  «  if  there  exist!  a  path  [a,«i, as, «]  of  length  one 
or  more  in  the  grid  graph  such  that  indcx(u,)  <  inin{fn«fez(a),iiider(«)),  for  1  <  r  <  k;  in  each  a  case, 
li,j  L\»  h  non-zero,  where  i  =  inJea:(u),j  =  indexfo). 
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each  vertex  in  F  is  “connected”  to  every  vertex  on  these  four  partial  sub-separators  and 
hence  the  four  rectangular  blocks  are  dense.  This  is  shown  schematically  in  Figure  7.  It 
can  be  verified  that  if  F  is  a  horizontal  m-vertex  sub-separator,  then  the  surrounding  box  of 
vertices  is  of  dimension  {2m +  5)  x  (m-t-2).  Therefore  there  are  two  rectangular  off-diagonal 
dense  blocks  of  dimension  at  most  (2m  -I-  3)  x  m  and  the  other  two  of  dimension  at  most 
(m-t-2)  X  m.  Similarly,  if  r  is  a  vertical  m-vertex  sub-separator,  there  are  four  off-diagonal 

»  rectangular  blocks  of  dimension  at  most  (m  -f  2)  x  m  in  the  factor.  If  F  is  not  surrounded 

on  all  four  sides  then  some  of  these  blocks  will  be  missing.  I 

I 


Figure  7:  Off-diagonal  blocks  with  nonzeros  corresponding  to 
sub-separator  F 


From  the  above  lemma  it  is  clear  that,  in  computing  the  nonzero  elements  in  the  columns 
corresponding  to  the  vertices  on  a  sub-separator,  only  the  data  dependencies  of  the  elements 
in  the  four  rectangular  blocks  and  the  diagonal  triangular  block  need  be  considered.  This 
is  accomplished  in  the  following  lemma  where  a  bound  is  derived  on  the  amount  of  data 
required  in  computing  the  nonzero  elements  lying  on  a  given  row  and  on  one  of  the  five 
blocks.  The  lemma  shows  that  the  number  of  nonzero  elements  in  any  row  i  of  the  factor 
L  is  less  than  c  •  m  where  c  is  an  integer  constant  and  m  is  the  size  of  llic  sub-separator 
to  which  the  vertex  corresponding  to  row  i  belongs.  It  is  then  shown  that,  for  any  row  i, 
the  computations  at  all  the  elements  /jj  c  L,  low{T)  <  J  <  high{T),  for  some  m-vertex 
sub-separator  F,  require  a  total  of  less  than  c-  m  nonzero  elements  from  that  row.  Note 
that  this  count  is  independent  of  the  sub-separator  to  which  the  vertex  corresponding  to 
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row  i  belongs.  Thus,  the  computations  at  aJi  the  elements  in  a  row  of  any  of  the  five  blocks 
specified  in  Lemma  3  require  only  c  m  elements  from  that  row,  irrespective  of  the  relative 
location  of  the  off-diagonal  blocks  in  the  factor. 


Lemma  4  Le.t  L  he  any  m-vertex  sub-separator.  The  nonzero  elements  from  row  i,  i  > 
lnw[T),  required  in  completing  the  computations  of  all  elements  lij  €  L  such  that  low{T)  < 
j  <  bigb{T),  are  those  elements  in  row  t  on  the  columns  in  the  set  given  by, 

For  all  i  greater  than  or  equal  to  low{T),  )  Pi  ^||  »•’  cit  most  c-m  for  some 

constant  c. 


Proof:  Any  nonzero  clement  I,  j  E  L,  i  >  lou'(r)  and  lou'(r)  <  j  <  high{T),  is  in  one  of  the 
five  blocks  specified  in  Lemma  3.  Hence,  to  prove  the  result  of  this  lemma,  only  the  rows 
that  intersect  one  of  these  blocks  need  to  be  considered.  The  result  for  low{T)  <  i  <  high{T) 
is  proved  first  followed  by  that  for  i  >  h.igb{T). 


When  lowiT)  <  i  <  high{T),  H 


•  high(\')  _ 

since,  7/,  '  ^  C  V,„,,^r)Mgh(rr 


By  definition,  the  set  contains  all  the  columns  that  have  a  nonzero  element  in  row 

i.  Clearly,  the  nonzero  elements  from  the  row  i  required  in  completing  the  computations  at 
all  the  elements  lij  €  L,  /ou'(r)  <  j  <  higk{T),  are  on  columns  in  the  set 


To  measure  the  size  of  the  set  note  that  it  is  bounded  by  the  number  of  vertices 

ordered  ahead  of  the  vertex  i  and  which  are  “connected”  to  vertex  i.  Using  the  recursive 
nature  of  the  nested  dissection  ordering  it  can  be  verified  that  in  the  case  of  constant  degree 
grid  graphs  and  when  /ou/(r)  <  i  <  high{T),  the  size  of  the  set  is  bounded  by  c-m, 

w'here  c  is  a  constant  dependent  both  on  the  degree  of  the  graph  and  on  whether  T  is  a 
horizontal  or  vertical  sub-separator.  If  T  is  a  horizontal  m-vertex  sub-separator  then,  for 
a  5-point  stencil,  c  is  equal  to  7  and,  for  a.  9-point  stencil,  c  is  equal  to  11.  When  F  is  a 
vertical  m-vertex  sub-separator,  the  values  of  c  are  5  and  7,  respectively.  This  completes 
the  proof  when  low(r)  <  i  <  high{r). 

The  case  where  i  >  high{T)  is  considered  next.  As  shown  above,  depends  on  the 

size  of  the  sub-separator  to  which  the  vertex  i  belongs  and  hence,  when  i  >  high{T), 
||^hig/i(r)||  much  greater  than  0{m)  where  m  is  size  of  F.  However,  when  the 

computa  tion  of  only  those  elements  in  row  i  that  lie  on  columns  low{T)  through  high{T)  are 
of  concern,  each  of  these  computations  consists  of  a  product  of  a  nonzero  element  in  row  t 
and  a  nonzero  element  in  one  of  the  rows  low{T)  through  high{T)  in  the  column  high{T)  or 
in  some  other  column  to  the  left  of  it.  Thus,  for  these  computations,  only  the  columns  that 
have  a  nonzero  element  in  row  i  and  in  row  j,  where  low{T)  <  j  <  high{r),  are  of  interest. 
The  set  ^^^nsists  of  all  columns  that  have  a  nonzero  element  in  at  least  one 

of  the  rows  low{T)  through  high{T).  Similarly,  consists  of  all  the  columns  that 

have  a  nonzero  element  in  tow  i.  Clearly,  the  set  H  consists  of  all  the 
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columns  which  contain  ail  the  pairs  of  nonzero  elements  that  must  be  used  in  completing  the 
computations  at  all  the  elements  /, j,  /oT/'(r)  <  j  <  hiphlT).  Thus,  the  nonzero  elements 
from  row  i,  i  >  high{T),  required  in  completing  computations  at  all  the  elements  /,  j  €  L 
such  that  /oi/'(r)  <  j  <  high{T),  are  those  elements  in  the  row  i  on  the  columns  in  the  set 
given  by  V, D  r/.  '  ^ 


To  get  a  bound  on  the  size  of  ^ the  m-vertex  horizontal 

sub  separator  F  shown  in  Figure  G.  It  is  surrounded  by  sub-separators  Fi,  F2,  F3,  and  F4. 
Suppose  that  /o?/'(F|)  <  i  <  hjgh{Ti).  The  set  )  fl  ^  consists  of  columns 

corresjionding  to  vertices  on  F  or  corresponding  to  those  vertices  ordered  ahead  of  them 
which  are  “connected"  to  at  least  one  vertex  in  F  and  to  the  vertex  corresponding  to  row  i. 
Using  the  rcciirsivo  nT'’^'ring  of  the  nested  dissection  scheme  it  can  be  shown  that  the  number 
of  such  vertices  is  less  than  7m.  Thus,  ||^fou’(rV/ii<7A(r)  /om.'(Fi)  <  i  < 

high{T\).  The  same  bound  is  obtained  when  /o«,'(F4)  <  i  <  high{Ti).  If  /om(F2)  <  i  < 
high{T:2)  or  /o7r(F3)  <  i  <  highiTz)  then  it  can  be  verified  that,  ^ - 

3m.  If  F  is  vertical  sub-separator  the  two  bounds  are  5m  and  5m/2  respectively.  I 


3.5  A  partitioning  scheme  with  minimum  data  traffic 

In  this  section  a  partitioning  scheme  for  computing  the  factor  of  the  sparse  matrix  A  is 
described.  Suppose  that  an  n  x  n  matrix  is  to  be  factored  using  n“  processors,  a  <  1. 
The  vertices  of  the  y/n  x  y/n  grid  graph  corresponding  to  this  matrix  are  ordered  using 
the  nested  dissection  method  described  earlier.  Assuming  n  =  (2*  —  1)^,  the  ordering  results 
in  /  levels  of  subgraphs  and  /  -  1  levels  of  “-|-”-separators.  If  the  original  y/n  x  y/ii  grid 
is  considered  to  be  on  level  0,  then  on  level  i  there  are  2^'  level-j  subgraphs  each  of  size 
(2'-'  -  1)  X  (2'“'  -  1).  Without  loss  of  generality  assume  that  a  •  /  is  an  integer.  Thus,  in 
the  partitioning  scheme  described  here,  all  the  vertices  on  a  level-of  subgraph  are  assigned 
to  the  same  processor.  In  that  scheme,  initially  each  processor  independently  computes  the 
elements  in  the  factor  corresponding  to  a  -  1)  x  —  1)  subgraph  which  are 

separated  from  one  another  by  the  level-a/  separators.  Once  the  elements  in  the  columns 
corresponding  to  the  vertices  on  the  level-(/  —  1)  through  level-a/  separators  are  computed 
locally,  a  processor  p,  combines  with  three  other  processors  to  compute  the  elements  on  the 
columns  of  L  corresponding  to  the  vertices  on  the  level-(a/  -  1)  “-f ’’-separator.  The  two 
horizontal  sub-separators  are  computed  by  two  processors  and  the  vertical  sub-separator  of 
that  level  is  computed  by  all  four  processors.  The  next  lower  level  “-f ’’-separator  is  computed 
in  parallel  by  sixteen  processors  from  the  four  neighboring  groups.  This  is  continued  until  all 
the  vertices  are  elimina  ted.  On  each  level  of  computation  each  group  of  processors  computes 
the  elements  of  the  factor  independent  of  the  pthcr  groups.  The  elimination  of  the  vertices 
on  the  vertical  sub-separator  of  level- 1  is  computed  in  parallel  by  all  processors.  This 
corresponds  to  factoring  a  y/n  x  y/n  dense  matrix.  The  computations  corresponding  to  the 
level-i  separator,  f  <  a  •/,  are  performed  as  follows.  The  computations  corresponding  to  the 
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vertices  on  level-(a/  -  k)  “+'’-separator,  1  <  k  <  a  I,  are  completed  by  p  =  2^  processors 
Working  in  parallel.  Using  all  the  available  processors,  the  factorization  corresponding  to 
the  7)!  X  71)  triangular  diagonal  bkick  is  first  completed.  Then  the  processors  are  used  to 
(•('inimte  the  elements  corresponding  to  the  four  off-diagonal  blocks.  For  the  first  part,  the 
BLOCC  factorization  scheme  described  for  the  dense  matrices  is  used.  The  m  x  m  dcn.se 
diagonal  block  is  partitioned  into  r^/2  -  r/2  square  blocks  and  r  diagonal  triangular  blocks 
each  of  size  in / r  x  m/r  where  p  =  r^/2  -f  r/2,  and  each  of  these  p  partitions  is  assigned  to  a 
unique  processor.  Each  processor  completes  the  compntatit)ns  corresponding  to  its  partition 
by  accessing  the  required  data  from  the  shared  memory.  For  the  purpose  of  factoring,  the 
off-diagonal  blocks  are  treated  as  if  they  were  adjacent,  and  the  resultant  rectangular  block 
is  partitioned  into  p  sub-blocks  each  of  size  c  ■  ml x  m  j yjp,  where  c  <  6  for  a  horizt)ntal 
snh-sef)ara tr)r  and  e  <  4  for  a  vertical  sub-separator.  Again  each  partition  is  assigned  to  a 
sei)arate  i)rocessor.  This  process  is  repeated  on  the  next  lower  level  “+”-separator.  Thus, 
in  the  assignment  scheme  described  here,  each  processor  is  assigned  a  new  subblock  on  each 
level  and  the  size  of  the  sul)block  assigned  to  a  processor  varies  from  one  level  to  the  next. 
Let  this  partitioning  scheme  be  referred  to  as  the  sparse  block  oriented  column  Cholesky 
factorization  scheme  or  simply  as  the  sparse  BLOCC  scheme.  Note  that  the  underlying 
numeric  algorithm  is  the  column  oriented  Cholesky  factorization. 


Data  traffic  associated  with  an  m-vertex  sub-separator 


The  sparse  BLOCC  scheme,  described  above,  may  be  considered  as  a  sequence  of  steps,  each 
step  corresponding  to  the  elimination  of  vertices  on  the  “-|-”-scpara.tors  of  some  level.  Ini¬ 
tially,  a  single  processor  computes  all  the  non-zero  elements  corresponding  to  a  “-|-”-separator 
in  the  factor.  As  the  computation  proceeds,  more  than  one  processor  w’ork  together  to  com¬ 
pute  the  elements  corresponding  to  a  “-|-”-separator.  On  any  such  step,  first  the  non-zero 
elements  in  the  columns  corresponding  to  the  horizontal  sub-separators  are  computed  and 
then  tho.se  in  the  columns  corresponding  to  the  vertical  part  are  computed.  Here  we  analyze 
the  data  traffic  ass<!ciafed  with  any  one  step,  on  which  p  processors  combine  together  to 
compute  the  elements  corresponding  to  a  sub-separatof 

By  Lemma  3,  for  any  sub-separator  F  there  ate  at  most  five  non-zero  blocks  in  the  columns 
corresponding  to  the  vertices  on  F.  The  number  of  non  zero  blocks  is  five  when  F  is  enclosed 
within  a  rectangular  box  formed  by  the  sub-separators  with  vertices  that  are  ordered  after 
those  on  F  (see  Figure  6).  The  following  lemma  gives  a  bound  on  the  data  traffic  associated 
with  computing  the  elements  in  the  columns  corresponding  to  such  sub-separators.  Not  all 
sub-separators  are  enclosed  by  such  rectangular  boxes.  In  such  cases  there  are  less  elements 
to  be  computed  and  consequently  there  is  less  data  traffic.  For  the  sake  of  simplicity  of  the 
analysis,  it  is  assumed  that  no  element  of  the  factor  needed  in  the  comnutation  of  the  five 
non-zero  blocks  is  initially  in  the  local  memory  of  any  of  the  p  processors.  Thus,  the  data 
traffic  given  below  is  a  conservative  estimate. 
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Lemma  5  Let  T  hr.  any  w-verte.T  fiih-nepaTator  and  p  he  the  number  nf  prncessorx  avaxlahle 
Jot  computing  the  elements  nf  the.  factor  in  all  the  non-zero  hlor.k.i  within  the  eolumn.t  loir{T) 
through  fngh{r].  If  F  is  an  m-verter  horizontal  suh- separator ,  then  the  associated  data 
traffic  i,*  at  most  (53  +  llv/2)n!^  ■  If  it  is  a  vertical  sub-separator,  then  the.  data  traffic 
t,«  at  most  (28  +  8v/2)r»i^  • 


Proof:  Let  F  bo  an  w  vertex  horizontal  sub-separator  that  is  enclosed  completely  within 
a  rectangular  box  formed  by  the  sub-separators  whose  vertices  are  eliminated  after  the 

(  vertices  of  F.  Such  a  sub  sejiarator  has  the  worst  case  communication  requirements  among 

all  the  T7)-vertex  sub-sejiarators. 

First,  consider  the  data  traiTic  associated  with  computing  the  elements  of  the  factor  in  the 
triangular  diagonal  block  using  p  =  r^/2  +  r/2  processors.  Each  of  the  sub-blocks  requires 
nonzero  elements  from  at  most  7m f  r  renvs  <^ut  of  the  m  rows  in  the  range  lou'(r)  through 
high{T)  of  the  factor.  No  other  information  is  neede<l.  Frejm  the  proof  of  Lemma  4, 
each  of  tlu'se  rows  has  at  most  llm  nonzeros.  Thus  the  communication  requirement  of 
each  partition  is  at  most  Urn  •  7m  j and  the  total  communication  requirement  t)f  the 
triangular  block  is  bounded  above  by  llv/2m^  •  yjp. 

Now  consider  the  data  traffic  associated  with  the  off-diagonal  blocks.  Each  partition  is 
of  size  6m I y/p  x  mf^/p.  Thus,  each  partition  requires  nonzero  elements  frt>m  6m / ^/p  rows 
which  are  below  the  row  high{T)  in  the  factor.  From  the  proof  of  Lemma  4,  each  of  these 
rows  has  at  most  7m  nonzoros  that  are  useful  in  completing  the  computations  in  any  of  the 
partitions.  Each  partition  also  requires  information  from  mj  ^  rows  from  the  region  /o7/’(F) 
through  high{T).  Each  of  these  rows  has  at  most  llm  nonzeros.  Thus,  the  communication 
requirement  of  each  partition  is  at  most  7m  •  6m/^/p  +  llm  •  m f  y/p  =  j y/p  and  the 
total  communication  requirement  of  completing  the  computations  at  the  off-diagonal  blocks 
using  p  proces.sors  is  less  than  or  equal  to  53m^y/p. 

Adding  the  com.munication  costs  corresponding  to  the  diagonal  and  the  off-diagonal  blocks 
we  get  the  total  data  traffic  associated  with  F  to  be  less  than  or  equal  to  (53  4  11  \/7)m^  ■  ^/p. 

j  A  similar  analysis  can  be  used  to  compute  the  data  traffic  when  F  is  an  m-vertex  vertical 

sub-separator  and  can  be  shown  to  be  bounded  above  by  (28  4  8\/2)m^  ■  y/p.  I 


The  total  data  traffic  of  the  sparse  BLOCC  scheme 


Applying  the  results  from  the  above  lemma,  a  bound  is  obtained  on  the  total  data  traffic  of 
the  sparse  BLOCC  scheme.  First  some  notation  is  introduced.  Let  Th{m,p,  k)  represent  the 
data  traffic  using  p  processors  in  completing  the  computations  at  all  the  nonzero  elements 
lij  €  T  in  the  columns  corresponding  to  an  m-vertex  horizontal  sub-separator  that  is 
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slirrouluiod  hv  higher  erdorod  vorticos  on  k  side?.  Lot  Ty[jn ,  p,  k)  roprosont  the  same  for  an 
rr?  vrrtox  vortical  suL  soparatdr.  Fmni  Lomina  5.  r/,(7n.p.  4)  is  at  most  (53  +  1 1  v/2)m^  • 
and  T,  [m.  p.  4)  is  at  most  (28  +  8\/2)m^  ■  y/p.  Lot  Tg{in',p,  k)  roprosont  tho  total  data  traffic, 
using  p  procossors.  in  completing  tho  computations  corresponding  to  all  tho  sub-separators 
within  an  m'-vortox  sub  grid  that  is  surrounded  by  higher  ordered  vorticos  on  k  sides.  Note 
that  tlio  qiiantitio.*;  t/,  and  r,.  rojirosont  tho  data  traffic  corresponding  to  tho  vorticos  on  a 
h"ri/ontal  and  a  vortical  sub  separator,  rospoctivoly,  whereas  Tg  represents  the  data  traffic 
(■(irre^pMuding  to  the  vertices  (m  an  entire  sub-grid. 

Tlu'  bdh'wing  thc'rcm  gives  an  upper  bound  on  the  total  data  traffic  in  factoring  tho 
matrix  A  assneiated  with  an  ii  vertex  2  D  regular  grid  graph  using  r?'*  processors  with  the 
scheduling  scheme  as  described  alxive. 


Theorem  4  Tim  total  data  traffic  in  factnrtnq  the  v  x  n  fparse  matrix  .4,  using  n'' 
prorcsiors.  is  (^1  n' i.e.,  Tg{n  .  ii"  ,0)  — 


Prorf:  (fn  an  x  regular  grid  there  is  an  n'^^-ver'ex  vertical  sub-separator 

and  two  n '' ^/2  vertex  hori7ontal  suli-separators  (ignoring  the  additive  constant  -1).  The 
vertical  sub-separator  is  not  surrounded  by  any  vertices  that  are  ordered  after  the  vertices 
on  the  vertical  sub-separator.  Each  of  the  two  horizontal  sub-separators  arc  surrounded 
by  such  vertices  only  on  one  side.  These  three  sub-separators  subdivide  the  n-vertex  grid 
graph  into  four  sub-grids  of  sire  n'^^/2  x  n'^^/2,  each  surrounded  on  two  sides  by  higher 
ordered  vertices.  Thus,  the  total  data  traffic  in  factoring  the  corresponding  matrix  ,4  is 
given  by, 

rg(n,  n'’,0)  =  r,(^‘/^  ti“,0)  +  ^n",  1)  +  4rp(^n, 

A  recursive  expansion  of  the  above  expression  contains  data  traffic  terms  for  vertical  sub¬ 
separators  of  different  sires  that  are  surrounded  on  zero  sides,  two  sides,  three  sides  (in  two 
different  ways),  and  on  all  four  sides  by  higher  ordered  vertices.  It  also  contains  data  traffic 
expressions  for  horizontal  sub-separators  of  different  sizes  surrounded  in  five  different  ways. 
To  keep  the  analysis  simple,  it  is  assumed  that  all  the  four  sub-grids  of  size  n'/^/2  x  n'/^/2 
are  surrounded  on  all  four  sides.  This  simplification  results  in  a  conservative  expres.sion  for 
the  data  traffic,  but  affects  only  the  constant  terms  in  the  bound.  Thus, 


Now, 


r„(n,  n'’,0)  <  r„(n'/^,n“,0)  +  2rfc(^n'^^,  +  4rg(^Ti,  |■n“,4). 


LpCt"'  7n",4)  +  2r^(Jn'/^  ^"".4)  + 


4  4 

From  Lemma  5,  it  follows  that, 

1  1 


16  16 


134  +  85v/2 


,l+a/7 


16 
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An  analysis  similar  to  that  given  in  Lemma  5  yields 


and 


Thus,  we  get 


<  8\/2  • 


'u  2  ’2  !  ~  g 


(  o  r,-.  ^  +  71\/2  i+o/2 

Tg{n.n  .0)  <  - — - V  I  . 


3.6  A  lower  bound  on  the  data  traffic  complexity 


In  the  following  theorem,  the  communication  bound  of  the  sparse  BLOCC  scheme  is  shown, 
by  giving  a  lower  bound  prtjof,  to  be  optimal  in  an  order  of  magnitude  sense. 


Theorem  5  Under  the  condition  of  uniform  load  dixtribntion,  the  data  traffic  in  factoring 
the  n  X  n  sparse  matrix  A,  using  n®  processors,  o  <  1,  is 

Proof:  For  a  regular  2-D  grid  graph  with  n  vertices,  the  separator  sire  for  nested  dissection 
ordering  is  [12].  From  Lemma  3,  it  follows  that  the  factor  L  has  an  x  dense 
triangular  diagonal  block  incorporated  in  it.  IVom  Theorem  2,  the  data  traffic  involved  in 
completing  the  computations  associated  with  the  elements  of  this  dense  triangular  block, 
under  the  condition  of  uniform  load  distribution  using  n®  processors,  is  f?(n''^®/^).  Since 
the  factorization  of  A  cannot  be  completed  without  completing  the  factorization  of  this 
dense  block,  the  result  follows.  B 


From  Theorem  4  and  Theorem  5,  it  is  clear  that  the  load  assignment  scheme  described 
here  for  factoring  the  n  x  n  sparse  matrix  using  n®  processors  is  optimal  in  an  order  of 
magnitude  sense.  Note  that  when  n®,  a  >  1,  processors  are  used,  the  data  traffic  bound 
given  in  Theorem  3  holds. 


4.  Concluding  remarks 


In  this  paper  we  have  analyzed  the  data  dependencies  in  the  Cholcsky  factorization  of 
dense  and  sparse  .symmetric,  po.sitive  definite  matrices.  The  model  of  computation  assumes 
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a  nniltijirorossor  system  with  a  memory  hierarchy.  Based  on  this  analysis  it  is  shown  (hat 
under  the  eondition  of  uniform  load  distrihntion  the  data  traffic  associated  with  factoring 
an  V  X  V  dense  matrix  is  when  n",  n  <  2,  processors  are  used.  The  same 

is  shown  to  he  Q{ji'  '^'^1'^).  o  <  1,  for  factoring  an  n  x  n  sparse  matrix  representing  a 
2-dimensi('nal  regular  grid  graph  where  the  vertices  are  ordered  using  the  nested  dissection 
ordering  mc'thods.  Block  based  partitioning  schemes  are  presented  that  asym  "'totically 
achieve  these  bounds  on  the  data  traffic. 

The  sequential  comimtation  time  for  factoring  the  v  x  v  sj)arsc  matrix  A  is  829n,^/^/84  + 
()[n  log  7?)  [6].  .4s  stated  for  the  dense  matrix  case,  the  assumption  here  is  that  the  compu¬ 
tation  cost  of  each  ste{)  of  the  innermost  loop  is  one  and  C(7sts  involved  in  the  other  steps  are 
ignored.  Under  the  same  assumption,  it  can  be  shown  that  the  computation  time  for  the 
sjiarse  BLOCC  scheme  is  at  most  283n^'^^“"/4  if  n"  processors  are  used.  In  [7],  a  parallel 
scheme  for  factoring  the  matrix  .4  on  a  multiprocessor  system  is  given  that  is  analogous 
to  the  wrap  around  assignment  scheme  described  in  the  Section  2.5  for  dense  matrices. 
This  scheme  has  the  property  of  distributing  the  work  evenly  among  the  processors.  The 
computation  time  to  factor  the  sparse  matrix  A  on  n"  processors  with  the  wrap  around 
scheme  is  at  most  However,  the  data  traffic  associated  with  that  scheme  is 

less  than  or  equal  to  183n'‘''“/4.  Note  that  the  difference  in  the  computation  time  with 
the  BLOCC  scheme  and  with  the  wrap  around  assignment  scheme  is  loss  than  a  factor  of 
two.  The  BLOCC  scheme  is  able  to  compute  the  factor  efficiently  in  the  case  of  the  sparse 
matrices  because  the  processors  are  now  assigned  blocks  in  a  wrap  around  fashion  which 
tends  to  distribute  (he  load  evenly.  On  the  other  hand,  the  data  traffic  associated  with  the 
BLOCC  scheme  is  an  order  of  magnitude  less  than  that  for  the  wrap  around  assignment 
scheme.  Moreover,  in  the  former  scheme,  as  many  as  n  processors  may  be  used  before  the 
total  data  traffic  reaches  the  maximum  value  of  0(n®/^),  whereas  in  the  later  scheme  only 
up  to  processors  may  be  used  efficiently.  The  impbc'*'-  s  of  the  reduced  data  traffic 
on  the  performance  are  as  follows. 

The  sparse  BLOCC  scheme  reduces  the  communication  requirement  to  by  re¬ 

moving  the  constraint  of  column-level  indivisibility.  Here  the  indivisible  work  unit  is  the 
computation  corresponding  to  a  nonzero  element  in  the  factor  The  reduction  in  the  com¬ 
munication  requirements  is  brought  about  by  improving  the  utilization  of  the  data  accessed 
from  shared  memory  by  each  processor.  Consider  the  factorization  of  an  m  x  m  dense 
matrix.  Let  the  data  utilization  of  a  data  element  accessed  by  a  processor  be  defined  as 
the  number  of  computations  in  which  that  element  is  used  by  that  processor  divided  by 
m.  Since  an  element  in  the  factor  is  needed  in  at  most  m  computations,  the  maximum 
utilization  of  any  data  acces.sed  is  one.  Let  the  aggregate  data  utilization  fot  a  processor  be 
defined  as  the  average  utilization  of  the  individual  data  elements  accessed  by  that  proces¬ 
sor.  In  the  BLOCC  scheme  applied  to  an  m  x  m  dense  matrix,  each  processor  accesses 
at  most  2m^/yp  elements  from  the  shared  memory  and  each  element  is  used  in  at  least 
ml y/f  computations.  Thus,  the  utilization  of  each  data  accessed  is  at  least  and  so  is 

the  aggregate  utilization  of  all  the  data  accesses.  On  the  other  hand,  with  the  column-level 
work  assignment  scheme,  each  processor  accesses  0{m?)  elements  from  the  shared  memory. 
Of  these,  only  0{mlp)  elements  have  a  utilization  of  one  and  the  data  utilization  for  the 
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rrmaining  olomonts  is  l/p  which  gives  an  aggregate  data  utilization  of  approximately  1/p. 
Similar  improvements  in  data  utilizations  are  obtained  in  factoring  a  sparse  matrix. 

It  should  be  noted  that  the  square  shape  of  the  submatrix  partitions  produce  the  best 
possible  aggregate  utilizations.  For  the  algorithm  considered  here,  the  data  dependencies 
are  such  that  rectangular  and  square  partitions  give  rise  to  high  data  utilizations.  Since 
the  square  partitions  have  the  minimum  perimeter  for  a  given  area,  the  number  of  data 
elements  accessed  (which  is  proportional  to  the  perimeter  of  the  partition)  for  a  given 
work  load  (which  is  proportional  to  the  area  enclosed),  is  also  a  minimum  for  the  square 
partitions. 

An  effect  of  the  improvement  in  the  aggregate  utilization  of  data  and  the  resulting  reduction 
in  the  communication  requirements  is  the  segregation  of  the  accesses  to  the  shared  data. 
Since  the  total  data  traffic  in  factoring  an  m  x  m  dense  matrix  using  p  processors  is 
0{m^  ■  y/p),  on  an  average  each  processor  accesses  only  I y/p)  data.  Note  that  the 

total  shared  data  is  0{m^).  Thus,  on  an  average  each  element  in  the  shared  memory  is 
accessed  by  0(y/p)  processors.  The  column-level  assignment  scheme,  however,  has  a  total 
data  traffic  of  0(ni^  ■  p)  and  thus,  on  an  average  each  processor  accesses  0{m})  data  or  on 
an  average  each  element  in  the  shared  memory  is  accessed  by  0{p)  processors.  An  obvious 
implication  of  this  observation  is  that  for  the  scheme  presented  here,  not  only  is  the  total 
data  traffic  reduced  but  also  the  requests  at  individual  shared  addresses.  This  can  have 
considerable  impact  on  the  performance  of  the  systems  with  a  large  number  of  processors. 

As  a  final  remark,  note  that  the  data  traffic  analysis  for  the  sparse  BLOCC  scheme  exploits 
the  fact  that  the  underlying  graph  satisfies  a  y^-separator  theorem.  Thus,  similar  schemes 
may  be  developed  for  any  class  of  graphs  satisfying  an  /(re)-separator  theorem  [13].  In 
such  cases  the  data  dependencies,  the  fill,  and  the  computation  time  depend  on  /(n).  In 
[12]  the  fill  and  the  bounds  on  the  sequential  computation  time  for  various  values  of  f{n) 
arc  listed.  Here  we  state  the  bounds  on  the  corresponding  data  traffic  when  the  systems 
are  computed  using  n"  processors.  The  data  traffic  of  factoring  a  matrix  corresponding  to 
an  n- vertex  3-dimensional  regular  grid  using  n°  processors  is  For  that  case 

the  computation  time  is  0(n^~“).  In  general,  the  total  data  traffic  for  computing  a  factor 
of  a  matrix  corresponding  to  a  d-dimensional  grid  is  where  tr  —  \  -  1/d.  The 

computation  cost  is  For  an  n- vertex  finite  element  graph^  with  no  element  having 

more  than  k  boundary  vertices,  the  total  data  traffic  in  factoring  the  associated  matrix  is 
0{k^  ■  The  computation  time  is  0{k^  •  All  these  quantities  are  optimal  in 

an  order  of  magnitude  sense. 


*A  finite  element  graph  is  any  graph  formed  from  a  planar  embedding  of  a  planar  graph  by  adding  all 
possible  diagonals  to  each  fare;  i.e.,  there  is  a  clique  corresponding  to  each  face  of  the  embedded  planar 
graph. 
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